Goto

Collaborating Authors

 clinical guideline


CARE-RAG - Clinical Assessment and Reasoning in RAG

Potluri, Deepthi, Mathew, Aby Mammen, DeWitt, Jeffrey B, Rasgon, Alexander L., Hao, Yide, Hong, Junyuan, Ding, Ying

arXiv.org Artificial Intelligence

Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.


Evaluating Large Language Models for Evidence-Based Clinical Question Answering

Wang, Can, Chen, Yiqun

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications, motivating rigorous evaluation of their ability to answer nuanced, evidence-based questions. We curate a multi-source benchmark drawing from Cochrane systematic reviews and clinical guidelines, including structured recommendations from the American Heart Association and narrative guidance used by insurers. Using GPT-4o-mini and GPT-5, we observe consistent performance patterns across sources and clinical domains: accuracy is highest on structured guideline recommendations (90%) and lower on narrative guideline and systematic review questions (60--70%). We also find a strong correlation between accuracy and the citation count of the underlying systematic reviews, where each doubling of citations is associated with roughly a 30% increase in the odds of a correct answer. Models show moderate ability to reason about evidence quality when contextual information is supplied. When we incorporate retrieval-augmented prompting, providing the gold-source abstract raises accuracy on previously incorrect items to 0.79; providing top 3 PubMed abstracts (ranked by semantic relevance) improves accuracy to 0.23, while random abstracts reduce accuracy (0.10, within temperature variation). These effects are mirrored in GPT-4o-mini, underscoring that source clarity and targeted retrieval -- not just model size -- drive performance. Overall, our results highlight both the promise and current limitations of LLMs for evidence-based clinical question answering. Retrieval-augmented prompting emerges as a useful strategy to improve factual accuracy and alignment with source evidence, while stratified evaluation by specialty and question type remains essential to understand current knowledge access and to contextualize model performance.


FHIR-RAG-MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

Kabak, Yildiray, Erturkmen, Gokce B. Laleci, Gencturk, Mert, Namli, Tuncay, Sinaci, A. Anil, Corcoles, Ruben Alcantud, Ballesteros, Cristina Gomez, Abizanda, Pedro, Dogac, Asuman

arXiv.org Artificial Intelligence

In recent years, the field of medical informatics has seen significant advancements with the introduction of medical large language models (LLMs). These models, powered by artificial intelligence, have demonstrated remarkable capabilities in understanding and generating medical text, providing valuable assistance in clinical decision - making, diagnostics, and patient care. Prom inent examples include models such as Meditron [1], BioMistral [2] and OpenBioLLM [3], which have shown considerable promise in various medical applications. However, despite these advancements, the inherent limitations of medical LLMs highlight the need for more robust solutions.


Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Wu, Weiyi, Xu, Xinwen, Gao, Chongyang, Diao, Xingjian, Li, Siting, Salas, Lucas A., Gui, Jiang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice. The dataset is available at https://huggingface.co/datasets/RDBH/DriftMed.


Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Mutisya, Fred, Gitau, Shikoh, Syovata, Christine, Oigara, Diana, Matende, Ibrahim, Aden, Muna, Ali, Munira, Nyotu, Ryan, Marion, Diana, Nyangena, Job, Ongoma, Nasubo, Mbae, Keith, Wamicha, Elizabeth, Mibuari, Eric, Nsengemana, Jean Philbert, Chidede, Talkmore

arXiv.org Artificial Intelligence

Large Language Models (LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care contexts remains under-explored. We present a rigorous methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2-3 (dispensary and health center) clinical care. Our approach leverages retrieval-augmented generation (RAG) to ground questions and answers in Kenya's national clinical guidelines, ensuring content aligns with local standard-of-care. The guidelines were digitised, chunked, and indexed for efficient semantic retrieval. Gemini Flash 2.0 Lite was then prompted with relevant guideline excerpts to generate realistic clinical questions, multiple - choice answers, and reasoning scenarios with source citations in English and Swahili. We engaged Kenyan physicians in a co - creation process to refine the dataset's relevance and fairness, and instituted a blinded expert validation pipeline to review for clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset comprises thousands of regulator-aligned question-answer pairs spanning common outpatient conditions in English and Swahili. Beyond standard accuracy metrics, we propose innovative evaluation measures targeting clinical reasoning, safety, and adaptability (e.g. Initial results highlight significant performance gaps in state - of-the - art LLMs when confronted with localized scenarios, echoing recent findings that LLM accuracy on African medical questions lags behind performance on U.S. benchmarks. Our work demonstrates a pathway for dynamic, locally-grounded benchmarks that can evolve with guidelines, providing a crucial tool for safe and effective deployment of AI in African healthcare. Advances in large language models have spurred interest in their potential to augment medical services, especially in low-and middle -income countries facing clinician shortages(Bekbolatova et al., 2024). By handling routine queries or providing decision support, LLMs might help bridge gaps in healthcare access across Africa.


Socially Constructed Treatment Plans: Analyzing Online Peer Interactions to Understand How Patients Navigate Complex Medical Conditions

Basak, Madhusudan, Sharif, Omar, Hulsey, Jessica, Saunders, Elizabeth C., Goodman, Daisy J., Archibald, Luke J., Preum, Sarah M.

arXiv.org Artificial Intelligence

When faced with complex and uncertain medical conditions (e.g., cancer, mental health conditions, recovery from substance dependency), millions of patients seek online peer support. In this study, we leverage content analysis of online discourse and ethnographic studies with clinicians and patient representatives to characterize how treatment plans for complex conditions are "socially constructed." Specifically, we ground online conversation on medication-assisted recovery treatment to medication guidelines and subsequently surface when and why people deviate from the clinical guidelines. We characterize the implications and effectiveness of socially constructed treatment plans through in-depth interviews with clinical experts. Finally, given the enthusiasm around AI-powered solutions for patient communication, we investigate whether and how socially constructed treatment-related knowledge is reflected in a state-of-the-art large language model (LLM). Leveraging a novel mixed-method approach, this study highlights critical research directions for patient-centered communication in online health communities.


Clean & Clear: Feasibility of Safe LLM Clinical Guidance

Ive, Julia, Jozsa, Felix, Jackson, Nick, Bondaronek, Paulina, Hill, Ciaran Scott, Dobson, Richard

arXiv.org Artificial Intelligence

Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare Q&A tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot's performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot' showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot's responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals.


Towards Conversational AI for Disease Management

Palepu, Anil, Liévin, Valentin, Weng, Wei-Hung, Saab, Khaled, Stutz, David, Cheng, Yong, Kulkarni, Kavita, Mahdavi, S. Sara, Barral, Joëlle, Webster, Dale R., Chou, Katherine, Hassidim, Avinatan, Matias, Yossi, Manyika, James, Tanno, Ryutaro, Natarajan, Vivek, Rodman, Adam, Tu, Tao, Karthikesalingam, Alan, Schaekermann, Mike

arXiv.org Artificial Intelligence

While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.


Expertise Is What We Want

Ashworth, Alan, Al-Dajani, Munir, Duchicela, Keegan, Kafadarov, Kiril, Kurian, Allison, Laraki, Othman, Lazrak, Amina, Mandair, Divneet, McKennon, Wendy, Miksad, Rebecca, Sanghvi, Jayodita, Zack, Travis

arXiv.org Artificial Intelligence

Clinical decision-making depends on expert reasoning, which is guided by standardized, evidence-based guidelines. However, translating these guidelines into automated clinical decision support systems risks inaccuracy and importantly, loss of nuance. We share an application architecture, the Large Language Expert (LLE), that combines the flexibility and power of Large Language Models (LLMs) with the interpretability, explainability, and reliability of Expert Systems. LLMs help address key challenges of Expert Systems, such as integrating and codifying knowledge, and data normalization. Conversely, an Expert System-like approach helps overcome challenges with LLMs, including hallucinations, atomic and inexpensive updates, and testability. To highlight the power of the Large Language Expert (LLE) system, we built an LLE to assist with the workup of patients newly diagnosed with cancer. Timely initiation of cancer treatment is critical for optimal patient outcomes. However, increasing complexity in diagnostic recommendations has made it difficult for primary care physicians to ensure their patients have completed the necessary workup before their first visit with an oncologist. As with many real-world clinical tasks, these workups require the analysis of unstructured health records and the application of nuanced clinical decision logic. In this study, we describe the design & evaluation of an LLE system built to rapidly identify and suggest the correct diagnostic workup. The system demonstrated a high degree of clinical-level accuracy (>95%) and effectively addressed gaps identified in real-world data from breast and colon cancer patients at a large academic center.


Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images

Cengiz, Sevim, Hamdi, Ibraheem, Yaqub, Mohammad

arXiv.org Artificial Intelligence

Fetal gestational age (GA) is vital clinical information that is estimated during pregnancy in order to assess fetal growth. This is usually performed by measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan which is then correlated with fetal age and growth trajectory. A major issue when performing the CRL measurement is ensuring that the image is acquired at the correct view, otherwise it could be misleading. Although clinical guidelines specify the criteria for the correct CRL view, sonographers may not regularly adhere to such rules. In this paper, we propose a new deep learning-based solution that is able to verify the adherence of a CRL image to clinical guidelines in order to assess image quality and facilitate accurate estimation of GA. We first segment out important fetal structures then use the localized structures to perform a clinically-guided mapping that verifies the adherence of criteria. The segmentation method combines the benefits of Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment fetal structures in ultrasound images and localize important fetal landmarks. For segmentation purposes, we compare our proposed work with UNet and show that our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, we compare the output of the mapping with classification CNNs when assessing the clinical criteria and the overall acceptability of CRL images. We show that the proposed mapping is not only explainable but also more accurate than the best performing classification CNNs.